Data Overview
import_data("jake_gyllenhaal")
filmes <- read_imported_data()
filmes %>%
glimpse()
Observations: 20
Variables: 5
$ avaliacao <int> 92, 67, 72, 52, 73, 59, 82, 85, 92, 49, 35, 64, 47, 90, 87, 61, 62, 44, 82, 86
$ filme <chr> "Stronger", "Life", "Nocturnal Animals", "Demolition", "Everest", "Southpaw", "Prisoners", "End of Watch", "Sour...
$ papel <chr> "Jeff Bauman", "David Jordan", "Tony HastingsEdward Sheffield", "Davis Mitchell", "Scott Fischer", "Billy \"The ...
$ bilheteria <dbl> 4.2, 30.2, 10.7, 1.7, 46.6, 42.4, 61.0, 39.1, 54.7, 33.3, 90.8, 28.6, 9.7, 33.0, 83.0, 62.6, 7.5, 186.6, 13.8, 4.2
$ ano <int> 2017, 2017, 2016, 2016, 2015, 2015, 2013, 2012, 2011, 2010, 2010, 2009, 2007, 2007, 2005, 2005, 2005, 2004, 2002...
Box Office
- Data refers to revenue collected inside the USA.
p <- filmes %>%
ggplot(aes(x = ano,
y = bilheteria,
text = paste("Movie:",filme,
"\nBox Office:",
bilheteria,"m",
"\nYear:",ano))) +
geom_point(size = 4, color = paleta[1]) +
labs(y = "Box Office (MM)", x = "Year of release")
ggplotly(p, tooltip = "text") %>%
layout(autosize = F)
Among the movies where Jake acted one sets itself apart from others in terms of revenue: The movie “The Day After Tomorrow” released in 2004.
It’s possible to notice a downward trend in the Box Office of the movies where Jake acted after 2013.
filmes %>%
ggplot(aes(x = bilheteria)) +
geom_histogram(aes(y=(..count..)/sum(..count..)),binwidth = 10, boundary = 0,
fill = "grey", color = "black") +
geom_rug(size = .5) +
scale_x_continuous(breaks=seq(0,200,20)) +
labs(y = "Relative Frequency", x = "Box Office (MM)")

We see a clear disparity between “The Day After Tomorrow” and the rest of the movies.
No values outside expected domain, e.g. negative values.
p <- filmes %>%
ggplot(aes(x = "",
y = bilheteria,
label = filme,
text = paste("Movie:",filme,
"\nBox Office:",
bilheteria,"m"))) +
geom_jitter(width = .05, alpha = .3, size = 3) +
labs(x = "", y="Box Office (MM)")
ggplotly(p, tooltip="text") %>%
layout(autosize = F)
Separate movies in those whose Box Office is below 50 millions and those whose Box Office is above that seems a reasonable approach.
“The Day After Tomorrow” seems to form a group of its own. Which would give us 3 groups.
Rating
p <- filmes %>%
ggplot(aes(x = ano,
y = avaliacao,
text = paste("Movie:",filme,
"\nRating:",
avaliacao,
"\nYear:",ano))) +
geom_point(size = 4, color = paleta[1]) +
scale_y_continuous(limits = c(0, 100)) +
labs(y = "Rating RT", x = "Year of Release")
ggplotly(p, tooltip = "text") %>%
layout(autosize = F)
- Between 2005 and 2010 Jake participated in a particular series of movies that did not please the critics.
- There doesn’t seem to exist a particularly clear tendency in the year of release.
filmes %>%
ggplot(aes(x = avaliacao)) +
geom_histogram(aes(y=(..count..)/sum(..count..)),binwidth = 10, boundary = 0,
fill = paleta[3], color = "black") +
geom_rug(size = .5) +
scale_x_continuous(breaks=seq(0,100,10)) +
labs(y = "Relative Frequency", x = "Rating RT")

It’s possible to notice a considerable number of movies with ratings above 80.
No values outside expected domain, e.g. negative values.
p <- filmes %>%
ggplot(aes(x = "",
y = avaliacao,
text = paste(
"Filme:",filme,
"\nAvaliação:",avaliacao))) +
geom_jitter(width = .05, alpha = .3, size = 3) +
labs(x = "", y="Avaliação RT")
ggplotly(p, tooltip = "text") %>%
layout(autosize = F)
- Intuitively three groups arise:
- The movies with ratings above 80
- The movies with ratings between 55 and 70
- The movies with ratings below 55
---
title: "Jake Gyllenhaal's type of movies"
author: "José Benardi de Souza Nunes"
date: "22/05/2018"
output:
  html_notebook:
    toc: yes
    toc_float: yes
  html_document:
    df_print: paged
    toc: yes
    toc_float: yes
---

<br/><br/>

# Introdução

> Exploratory data analysis on data from [RottenTomatoes](https://www.rottentomatoes.com/) about actor Jake Gyllenhaal. The code used to mine the data here analyzed and the explanation on how to use it can be found on [this report's repository](https://github.com/Benardi/agrupamento-filmes/) deste relatório.

* Entries that have no infomation about box office were ignored.

<br>

***

<br>

```{r echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(here)
library(cluster)
library(plotly)
library(ggdendro)

source(here::here("code/lib.R"))
source(here::here("code/plota_solucoes_hclust.R"))

theme_set(theme_report())

knitr::opts_chunk$set(tidy = FALSE,
                      fig.width = 6,
                      fig.height = 5,
                      echo = TRUE)
paleta = c("#404E4D",
           "#92DCE5",
           "#938BA1",
           "#2D3142",
           "#F4743B")
set.seed(101)
```

# Data Overview

```{r message=FALSE, warning=FALSE}
import_data("jake_gyllenhaal") 
filmes <- read_imported_data()
filmes %>% 
    glimpse()
```

## Box Office

* Data refers to revenue collected inside the USA.

```{r}
p <- filmes %>%
    ggplot(aes(x = ano, 
               y = bilheteria,
               text = paste("Movie:",filme,
                            "\nBox Office:",
                            bilheteria,"m",
                            "\nYear:",ano))) + 
    geom_point(size = 4, color = paleta[1]) +
    labs(y = "Box Office (MM)", x = "Year of release")

ggplotly(p, tooltip = "text") %>%
    layout(autosize = F)
```

* Among the movies where Jake acted one sets itself apart from others in terms of revenue: The movie **"The Day After Tomorrow"** released in 2004.

* It's possible to notice a downward trend in the Box Office of the movies where Jake acted after 2013.  

```{r}
filmes %>% 
    ggplot(aes(x = bilheteria)) + 
    geom_histogram(aes(y=(..count..)/sum(..count..)),binwidth = 10, boundary = 0, 
                   fill = "grey", color = "black") + 
    geom_rug(size = .5) +
    scale_x_continuous(breaks=seq(0,200,20)) +
    labs(y = "Relative Frequency", x = "Box Office (MM)")
```

* We see a clear disparity between **"The Day After Tomorrow"** and the rest of the movies.

* No values outside expected domain, e.g. negative values.

```{r}
p <- filmes %>% 
    ggplot(aes(x = "",
               y = bilheteria,
               label = filme,
               text = paste("Movie:",filme,
                            "\nBox Office:",
                            bilheteria,"m"))) + 
    geom_jitter(width = .05, alpha = .3, size = 3) + 
    labs(x = "", y="Box Office (MM)")

ggplotly(p, tooltip="text") %>% 
    layout(autosize = F)
```

* Separate movies in those whose Box Office is below 50 millions and those whose Box Office is above that seems a reasonable approach. 

* **"The Day After Tomorrow"** seems to form a group of its own. Which would give us 3 groups.

## Rating

```{r}
p <- filmes %>% 
    ggplot(aes(x = ano, 
               y = avaliacao,
                text = paste("Movie:",filme,
                            "\nRating:",
                            avaliacao,
                            "\nYear:",ano))) + 
    geom_point(size = 4, color = paleta[1])  +
    scale_y_continuous(limits = c(0, 100)) +
    labs(y = "Rating RT", x = "Year of Release")

ggplotly(p, tooltip = "text") %>%
    layout(autosize = F)
```

* Between 2005 and 2010 Jake participated in a particular series of movies that did not please the critics. 
* There doesn't seem to exist a particularly clear tendency in the year of release. 

```{r}
filmes %>% 
    ggplot(aes(x = avaliacao)) + 
    geom_histogram(aes(y=(..count..)/sum(..count..)),binwidth = 10, boundary = 0, 
                   fill = paleta[3], color = "black") + 
    geom_rug(size = .5) +
    scale_x_continuous(breaks=seq(0,100,10)) +
    labs(y = "Relative Frequency", x = "Rating RT")
```

* It's possible to notice a considerable number of movies with ratings above 80.

* No values outside expected domain, e.g. negative values.

```{r}
p <- filmes %>% 
    ggplot(aes(x = "",
               y = avaliacao,
               text = paste(
                    "Filme:",filme,
                    "\nAvaliação:",avaliacao))) + 
    geom_jitter(width = .05, alpha = .3, size = 3) + 
    labs(x = "", y="Avaliação RT")

ggplotly(p, tooltip = "text") %>% 
    layout(autosize = F)

```

* Intuitively three groups arise:
    * The movies with ratings above 80
    * The movies with ratings between 55 and 70 
    * The movies with ratings below 55

<br>

***

<br>

# Hierarchical Clustering

<br>
<br>

## One dimension

<br>

### Box Office

```{r}
agrupamento_h = filmes %>% 
    mutate(nome = paste0(filme, " (bil=", bilheteria, ")")) %>% 
    as.data.frame() %>% 
    column_to_rownames("filme") %>% 
    select(bilheteria) %>%
    dist(method = "euclidian") %>% 
    hclust(method = "centroid")

ggdendrogram(agrupamento_h, rotate = T, size = 2, theme_dendro = F) +
    labs(y = "Dissimilarity", x = "", title = "Dendrogram") +
    geom_hline(aes(yintercept = c(20,30), color=c("4 grupos","3 grupos"))) +
    scale_colour_manual(name="#Groups",
    values=c("#56B4E9", "#FF9999"))
```

* In terms of Dendrogram the separation in four and three groups seems apropriate, given that the increase in dissimilarity from 4 to 3 groups doesn't seem to be substantial.
* Cut made for 4 groups

```{r}
atribuicoes = get_grupos(agrupamento_h, num_grupos = 1:6)

atribuicoes = atribuicoes %>% 
    left_join(filmes, by = c("label" = "filme"))

atribuicoes %>% 
    ggplot(aes(x = "Movies", y = bilheteria, colour = grupo)) + 
    geom_jitter(width = .02, height = 0, size = 1.6, alpha = .6) + 
    facet_wrap(~ paste(k, " groups")) + 
    scale_color_brewer(palette = "Dark2") +
    labs(y = "Box Office (MM)", x = "", title = "Grouping by Box Office") +
    guides(color=guide_legend(title="group"))
```

* The division in 4 groups seems more appropriate than the division in 3 groups.
    * The movie cluster on the base of chart seems to require its own group (In the 4 groups division the aforementioned group would be the group 1).

```{r}
k_escolhido = 4

m <- list(l = 220)

p <-atribuicoes %>% 
    filter(k == k_escolhido) %>% 
    ggplot(aes(x = reorder(label, bilheteria),
               y = bilheteria,
               colour = grupo,
               text = paste(
                    "Movie:", reorder(label, bilheteria),
                    "\nRating:", bilheteria,
                    "\nGroup:", grupo))) + 
    geom_jitter(width = .02, height = 0, size = 3, alpha = .6) + 
    facet_wrap(~ paste(k, " groups")) + 
    scale_color_brewer(palette = "Dark2") + 
    labs(x = "", y = "Rating RT") + 
    guides(color=guide_legend(title="group")) +
    coord_flip()

ggplotly(p,tooltip = "text") %>%
    layout(autosize = F, margin = m)

```

* **The Day After Tomorrow** demanded a group for itself, as expected.

<br>

### Rating 

```{r}
agrupamento_h = filmes %>% 
    mutate(nome = paste0(filme, " (av=", avaliacao, ")")) %>% 
    as.data.frame() %>% 
    column_to_rownames("filme") %>% 
    select(avaliacao) %>%
    dist(method = "euclidian") %>% 
    hclust(method = "ward.D")

ggdendrogram(agrupamento_h, rotate = T, size = 2, theme_dendro = F) +
    labs(y = "Dissimilarity", x = "", title = "Dendrogram") +
    geom_hline(aes(yintercept = 30),color="red")
```

* In terms of Dendrogram the **division in three groups seems the most appropriate**, given that the increase in dissimilarity becomes substantial when we go from 3 to 2 groups.

```{r}
atribuicoes = get_grupos(agrupamento_h, num_grupos = 1:6)

atribuicoes = atribuicoes %>% 
    left_join(filmes, by = c("label" = "filme"))

atribuicoes %>% 
    ggplot(aes(x = "Movies", y = avaliacao, colour = grupo)) + 
    geom_jitter(width = .02, height = 0, size = 1.6, alpha = .6) + 
    facet_wrap(~ paste(k, " groups")) + 
    scale_color_brewer(palette = "Dark2") +
    guides(color=guide_legend(title="group")) +
    labs(y = "Rating RT", x = "", title = "Grouping by Rating")

```

* Visually the division in three groups seems appropriate in accordance with the dendrogram.

```{r}
k_escolhido = 3

m <- list(l = 220)

p <-atribuicoes %>% 
    filter(k == k_escolhido) %>% 
    ggplot(aes(x = reorder(label, avaliacao),
               y = avaliacao,
               colour = grupo,
               text = paste(
                    "Movie:", reorder(label, avaliacao),
                    "\nRating:", avaliacao,
                    "\nGroup:", grupo))) + 
    geom_jitter(width = .02, height = 0, size = 3, alpha = .6) + 
    facet_wrap(~ paste(k, " groups")) + 
    scale_color_brewer(palette = "Dark2") + 
    labs(x = "", y = "Rating RT") + 
    guides(color=guide_legend(title="group")) +
    coord_flip()

ggplotly(p,tooltip = "text") %>%
    layout(autosize = F, margin = m)

```

* Arguably, **Prince of Persia: The Sands of Time** could demand a group of its own.

<br>

## Two dimensions

<br>

### How many groups should we choose? 

<br>

```{r, warning=FALSE}
agrupamento_h_2d = filmes %>%
   mutate(bilheteria = log10(bilheteria)) %>%
   mutate_at(vars("avaliacao", "bilheteria"), funs(scale)) %>%
   column_to_rownames("filme") %>%
   select("avaliacao", "bilheteria") %>%
   dist(method = "euclidean") %>%
   hclust(method = "ward.D")

ggdendrogram(agrupamento_h_2d, rotate = TRUE, theme_dendro = F) +
    labs(y = "Dissimilarity", x = "", title = "Dendrogram") +
    geom_hline(aes(yintercept = 4),color="red")

```

* Going from 4 to 3 groups represents little variation in terms of dissimilarity
* Going from 3 to 2 groups represents a relatively substantial increase in dissimilarity, therefore from 6 to 3 groups seems a good choice in terms of dendrogram.

```{r}
filmes2 <- filmes %>%
    mutate(bilheteria = log10(bilheteria))

plota_hclusts_2d(agrupamento_h_2d,
                filmes2,
                c("avaliacao", "bilheteria"),
                linkage_method = "ward.D", 
                ks = 1:6,
                palette = "Dark2") + 
    facet_wrap(~ paste(k, " groups")) +
    scale_y_log10() +
    guides(color=guide_legend(title="group")) +
    labs(y = "Box Office", x = "Rating", title = "Grouping with two dimensions")
```

* The choice of 5 groups seems appropriate, as it reflects matters of Box Office as matters of Ratings. **We'll choose 5 groups** for the following reasons (Groups mentioned on the 5 groups divisition): 
    * The $\color{magenta}{\text{4 best rated movies}}$ are very close to each other and suggest a group.
    * The $\color{#7C3F7C}{\text{3 movies of small Box Office and low ratings}}$ are very dissimilar the rest of the movies and suggest a group.
    * The $\color{#16A085}{\text{4 movies of small Box Office and high ratings}}$ are close to each other and suggest a group.
    * The $\color{green}{\text{2 movies of huge Box Office and very low ratings}}$ are very far from the rest of the movies and suggest a group.
    * The $\color{#CF5300}{\text{6 central/median movies in terms of Box Office/Rating}}$ are very close to each other and suggest a group.

```{r}
atribuicoes = get_grupos(agrupamento_h_2d, num_grupos = 1:6)

atribuicoes = atribuicoes %>% 
    filter(k == 5) %>%
    mutate(filme = label) %>% 
    left_join(filmes, by = "filme")

p <- atribuicoes %>%
    ggplot(aes(x = avaliacao,
               y = bilheteria,
               colour = grupo,
               text = paste(
                    "Movie:", filme,
                    "\nBox Office:", bilheteria,"m\n",
                    "Rating:", avaliacao))) + 
    geom_jitter(width = .02, height = 0, size = 3, alpha = .6) + 
    facet_wrap(~ paste(k, " groups")) + 
    scale_color_brewer(palette = "Dark2") +
    scale_y_log10() +
    guides(color=guide_legend(title="group")) +
    labs(y = "Box Office", x = "Rating RT")


ggplotly(p, tooltip = "text") %>%
    layout(autosize = F)
```

<br>

***

<br>

### What are the name of the groups? 

<br>

$\color{#16A085}{\text{Group 1 (Oddball):}}$ _Movies overall well received by the public_, which reflected on its low revenue. The name Oddball comes from the later interest on the movies from people who consider themselves excentric not rarely to revalidate their sense of exclusivity.

<br/>

$\color{#CF5300}{\text{Group 2 (Matinee):}}$ Movies overall _not so well received by the critics and more formulaic_. In terms of box office most of them had low revenue but the movie paid itself. The name Matinee comes from the idea of that movies that did not perform that well or stopped being the hot topic for a long time occupying being aired by that time.

<br/>

$\color{#7C3F7C}{\text{Group 3 (Demolition of a budget):}}$ Movies overall _poorly received by both critics and public_, which reflected on its small Box Office and low ratings. The name of the group is a wordplay with the very small revenue rendered by the movies, which "demolished" the investment of those who betted in them. 

<br/>

$\color{magenta}{\text{Grupo 4 (Broke Records and Awards):}}$ Movies _acclaimed by critics_ and whose box office was either successfull or at least decent. The movies in this group have a more serious tone, talking of serious matter that frequently create controversy (serial murders, non heterosexuality, terrorism..). The name of the group is a word play with the name of one of its movies and the sheer amount of prizes this particular movie won. 

<br/>

$\color{green}{\text{Grupo 5 (BlockBusters):}}$ Movies in which Jack acted that the critics didn't like that much but who collected a huge box office, with a revenue on the scale of hundreds of millions. The term BlockBuster is usually given to movies who attract crowds to the movie theaters, which is the case of the movies who belong to this group.

<br>

***

<br/>

### Filme-exemplo de cada grupo  

<br/>

$\color{#16A085}{\text{Grupo 1 (Oddball):}}$

* **Stronger**: Filme biográfico sobre 'Jeff Bauman', vítima do atentado de Boston que perdeu ambas as pernas na explosão. O filme foi muito bem recebido pelos críticos que o elogiaram por ser bem executado, comovente e por focar numa história de superação ao invés de usar a tragédia para alimentar a paranóia em cima do terrorismo. O filme porém foi um fracasso em termos de bilheteria.

<br/>

$\color{#CF5300}{\text{Grupo 2 (Sessão da Tarde):}}$

* **Life**: Filme no gênero ficção científica espacial, teve um arrecadamento não muito expressivo
assim como críticas igualmente pouco entusiasmadas. Foi considerado por muitos bem executados porém pouco inovativo.

<br/>

$\color{#7C3F7C}{\text{Grupo 3 (Demolition of a budget):}}$ 

* **Demolition**: Neste filme Jake atua no papel de um homem que volta ao trabalho depois de perder a esposa e encontra contato humano em uma atendente de telemarketing ao reclamar de uma vending machine. O filme foi um fracasso em termos de arrecadamento assim como em termos de crítica. O filme teve seu script apontado como grande problema, esse foi descrito como 'tentando afetar profundidade' e anti-carismático.
    
<br/>

$\color{magenta}{\text{Grupo 4 (Broke Records and Awards):}}$

* **Brokeback Mountain**: Provavelmente a melhor atuação de Jake Gyllenhaal até o momento, esse filme rendeu a Jake uma indicação ao Oscar e levantou muita controvérsia por conter uma cena de sexo entre pessoas do mesmo sexo. A Academia (responsável por escolher os vencedores do Oscar) foi acusada de homofobia por não escolher esse filme como o ganhador de Melhor Fotografia, ainda assim Brokeback Mountain ganhou outros 141 prêmios e 128 nominações de acordo com o IMDB. O filme foi considerado um sucesso tanto em faturamento como em avaliação.

<br/>

$\color{green}{\text{Grupo 5 (BlockBusters):}}$ 

* **Prince of Persia: The Sands of Time**: Baseado no jogo de mesmo nome, jogo que ainda é pra muitos uma referência em qualidade e inovação. O filme resultou em comentários decepcionados tanto de críticos e fãs, os quais curiosamente não falharam em contribuir pro arrecadamento do filme.
    
